normalization parameter
Dynamic Normalization and Relay for Video Action Recognition
Convolutional Neural Networks (CNNs) have been the dominant model for video action recognition. Due to the huge memory and compute demand, popular action recognition networks need to be trained with small batch sizes, which makes learning discriminative spatial-temporal representations for videos become a challenging problem. In this paper, we present Dynamic Normalization and Relay (DNR), an improved normalization design, to augment the spatial-temporal representation learning of any deep action recognition model, adapting to small batch size training settings. We observe that state-of-the-art action recognition networks usually apply the same normalization parameters to all video data, and ignore the dependencies of the estimated normalization parameters between neighboring frames (at the same layer) and between neighboring layers (with all frames of a video clip). Inspired by this, DNR introduces two dynamic normalization relay modules to explore the potentials of cross-temporal and cross-layer feature distribution dependencies for estimating accurate layer-wise normalization parameters. These two DNR modules are instantiated as a light-weight recurrent structure conditioned on the current input features, and the normalization parameters estimated from the neighboring frames based features at the same layer or from the whole video clip based features at the preceding layers. We first plug DNR into prevailing 2D CNN backbones and test its performance on public action recognition datasets including Kinetics and Something-Something. Experimental results show that DNR brings large performance improvements to the baselines, achieving over 4.4% absolute margins in top-1 accuracy without training bells and whistles.
- North America > Canada > Quebec > Montreal (0.04)
- Europe > France > Hauts-de-France > Pas-de-Calais (0.04)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (0.96)
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.86)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.86)
- North America > Canada > Quebec > Montreal (0.04)
Appendices for the Paper pFL Bench A Comprehensive Benchmark for Personalized Federated Learning
Sec.A: the details of adopted datasets and models ( e.g., tasks, heterogeneous partitions, and FL processes as shown in Figure 4 in the main body of the paper. The CIF AR10 is a popular dataset for 10-class image classification containing 60,000 colored images with resolution of 32x32 pixels. The train/valid/test sets are with a ratio of about 3:1:5. The train/valid/test sets are with ratio about 14:3:3. For all these experimental datasets, we randomly select 20% clients as new clients that do not participate in the FL processes.
Find A Winning Sign: Sign Is All We Need to Win the Lottery
Oh, Junghun, Baik, Sungyong, Lee, Kyoung Mu
The Lottery Ticket Hypothesis (LTH) posits the existence of a sparse subnetwork (a.k.a. winning ticket) that can generalize comparably to its over-parameterized counterpart when trained from scratch. The common approach to finding a winning ticket is to preserve the original strong generalization through Iterative Pruning (IP) and transfer information useful for achieving the learned generalization by applying the resulting sparse mask to an untrained network. However, existing IP methods still struggle to generalize their observations beyond ad-hoc initialization and small-scale architectures or datasets, or they bypass these challenges by applying their mask to trained weights instead of initialized ones. In this paper, we demonstrate that the parameter sign configuration plays a crucial role in conveying useful information for generalization to any randomly initialized network. Through linear mode connectivity analysis, we observe that a sparse network trained by an existing IP method can retain its basin of attraction if its parameter signs and normalization layer parameters are preserved. To take a step closer to finding a winning ticket, we alleviate the reliance on normalization layer parameters by preventing high error barriers along the linear path between the sparse network trained by our method and its counterpart with initialized normalization layer parameters. Interestingly, across various architectures and datasets, we observe that any randomly initialized network can be optimized to exhibit low error barriers along the linear path to the sparse network trained by our method by inheriting its sparsity and parameter sign information, potentially achieving performance comparable to the original. The code is available at https://github.com/JungHunOh/AWS\_ICLR2025.git
- North America > Canada > Ontario > Toronto (0.14)
- Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- Contests & Prizes (1.00)
- Research Report > New Finding (0.46)
HyperCLIP: Adapting Vision-Language models with Hypernetworks
Akinwande, Victor, Norouzzadeh, Mohammad Sadegh, Willmott, Devin, Bair, Anna, Ganesh, Madan Ravi, Kolter, J. Zico
Self-supervised vision-language models trained with contrastive objectives form the basis of current state-of-the-art methods in AI vision tasks. The success of these models is a direct consequence of the huge web-scale datasets used to train them, but they require correspondingly large vision components to properly learn powerful and general representations from such a broad data domain. This poses a challenge for deploying large vision-language models, especially in resourceconstrained environments. To address this, we propose an alternate vision-language architecture, called HyperCLIP, that uses a small image encoder along with a hypernetwork that dynamically adapts image encoder weights to each new set of text inputs. All three components of the model (hypernetwork, image encoder, and text encoder) are pre-trained jointly end-to-end, and with a trained HyperCLIP model, we can generate new zero-shot deployment-friendly image classifiers for any task with a single forward pass through the text encoder and hypernetwork. HyperCLIP increases the zero-shot accuracy of SigLIP trained models with small image encoders by up to 3% on ImageNet and 5% on CIFAR-100 with minimal training throughput overhead. A now-standard approach in deep learning for vision tasks is to first pre-train a model on web-scale data and then adapt this model for a specific task using little or no additional data. Despite the widespread success of these models and their lack of reliance on large-scale labeled datasets, a significant downside is that these models are often on the order of billions of parameters - much larger than their supervised counterparts for a given task at the same accuracy level. While these pre-trained models are powerful due to their generality, practitioners still need to apply them to well defined and specific tasks. We consider settings where there are additional constraints on the size of these models such as in edge computing applications.
- Research Report > New Finding (0.46)
- Research Report > Promising Solution (0.34)